128 research outputs found

    Text-mining and information-retrieval services for molecular biology

    Get PDF
    Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators

    A sentence sliding window approach to extract protein annotations from biomedical articles

    Get PDF
    From A critical assessment of text mining methods in molecular biology[Background] Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations.[Results] The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations).[Conclusion] We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications.This work was sponsored by DOC, the doctoral scholarship programme of the Austrian Academy of Sciences and the ORIEL (IST-2001-32688) and TEMBLOR (QLRT-2001-00015) projects.Peer reviewe

    The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge

    Get PDF
    Biomedical text mining methods and technologies have improved significantly in the last decade. Considerable efforts have been invested in understanding the main challenges of biomedical literature retrieval and extraction and proposing solutions to problems of practical interest. Most notably, community-oriented initiatives such as the BioCreative challenge have enabled controlled environments for the comparison of automatic systems while pursuing practical biomedical tasks. Under this scenario, the present work describes the Markyt Web-based document curation platform, which has been implemented to support the visualisation, prediction and benchmark of chemical and gene mention annotations at BioCreative/CHEMDNER challenge. Creating this platform is an important step for the systematic and public evaluation of automatic prediction systems and the reusability of the knowledge compiled for the challenge. Markyt was not only critical to support the manual annotation and annotation revision process but also facilitated the comparative visualisation of automated results against the manually generated Gold Standard annotations and comparative assessment of generated results. We expect that future biomedical text mining challenges and the text mining community may benefit from the Markyt platform to better explore and interpret annotations and improve automatic system predictions. Database URL: http://www.markyt.org, https://github.com/sing-group/MarkytThis work was partially funded by the [14VI05] Contract-Programme from the University of Vigo and the Agrupamento INBIOMED from DXPCTSUG-FEDER unha maneira de facer Europa (2012/273) as well as by the Foundation for Applied Medical Research, University of Navarra (Pamplona, Spain). The research leading to these results has received funding from the European Union's Seventh Framework Programme FP7/REGPOT-2012-2013.1 under grant agreement no 316265, BIOCAPS

    Overview of the protein-protein interaction annotation extraction task of BioCreative II

    Get PDF
    © 2008 Krallinger et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution Licens

    ExTRI: Extraction of transcription regulation interactions from literature

    Get PDF
    The regulation of gene transcription by transcription factors is a fundamental biological process, yet the relations between transcription factors (TF) and their target genes (TG) are still only sparsely covered in databases. Text-mining tools can offer broad and complementary solutions to help locate and extract mentions of these biological relationships in articles. We have generated ExTRI, a knowledge graph of TF-TG relationships, by applying a high recall text-mining pipeline to MedLine abstracts identifying over 100,000 candidate sentences with TF-TG relations. Validation procedures indicated that about half of the candidate sentences contain true TF-TG relationships. Post-processing identified 53,000 high confidence sentences containing TF-TG relationships, with a cross-validation F1-score close to 75%. The resulting collection of TF-TG relationships covers 80% of the relations annotated in existing databases. It adds 11,000 other potential interactions, including relationships for ~100 TFs currently not in public TF-TG relation databases. The high confidence abstract sentences contribute 25,000 literature references not available from other resources and offer a wealth of direct pointers to functional aspects of the TF-TG interactions. Our compiled resource encompassing ExTRI together with publicly available resources delivers literature-derived TF-TG interactions for more than 900 of the 1500–1600 proteins considered to function as specific DNA binding TFs. The obtained result can be used by curators, for network analysis and modelling, for causal reasoning or knowledge graph mining approaches, or serve to benchmark text mining strategies.We thank the participants of the COST Action GREEKC (CA15205) for fruitful discussions during workshops supported by COST (European Cooperation in Science and Technology).Peer ReviewedPostprint (published version

    Evaluation of BioCreAtIvE assessment of task 2

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed.</p> <p>Results</p> <p>The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest consists of a community wide competition aiming to evaluate different strategies for text mining tools, as applied to biomedical literature. We report on task two which addressed the automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles. The predictions of task 2 are based on triplets of <it>protein – GO term – article passage</it>. The annotation-relevant text passages were returned by the participants and evaluated by expert curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each participant could submit up to three results for each sub-task comprising task 2. In total more than 15,000 individual results were provided by the participants. The curators evaluated in addition to the annotation itself, whether the protein and the GO term were correctly predicted and traceable through the submitted text fragment.</p> <p>Conclusion</p> <p>Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically. Although the obtained results are promising, they are still far from reaching the required performance demanded by real world applications. Among the principal difficulties encountered to address the proposed task, were the complex nature of the GO terms and protein names (the large range of variants which are used to express proteins and especially GO terms in free text), and the lack of a standard training set. A range of very different strategies were used to tackle this task. The dataset generated in line with the BioCreative challenge is publicly available and will allow new possibilities for training information extraction methods in the domain of molecular biology.</p

    Biocuration Workflow Catalogue

    Get PDF
    As the first phase of a knowledge engineering study of biocuration workflows, we performed a preliminary task-modeling exercise on seven separate bioinformatics systems. This involved constructing UML activity diagrams from detailed interviews with curators in order to understand the organization of the process the biocurators used to populate their system. The objective of this work was to identify common patterns within the workflows where we might apply text mining methods to accelerate curation. We compiled a number of workflows in a common format but were largely unable to consolidate these structures into a formal structure that facilitated comparison across workflows. We presented this work as a slideshow and publish this account of the catalog as supplementary information

    Construction of medical terminological resources for Spanish: the CUTEXT term extraction system and biomedical term repositories

    Get PDF
    El uso frecuente de términos médicos motivó la construcción de grandes recursos terminológicos para el inglés, como el Unified Medical Language System (UMLS) o las ontologies Open Biological and Biomedical Ontology (OBO). La construcción exclusivamente manual de recursos terminológicos es en sí misma muy valiosa, pero constituye (1) un proceso laborioso que requiere mucho tiempo, (2) no garantiza que los conceptos o términos incluidos se ‘alineen’ realmente con el lenguaje médico y los términos que se usan en los documentos clínicos escritos por los profesionales de la salud y (3) requiere actualización constante y revisión debido a los cambios y la aparición de nuevos conceptos biomédicos. En este artículo presentamos una herramienta de extracción de términos médicos multilingüe, llamada CUTEXT (CValue Utilizado para Extraer Términos), un recurso promovido por el Plan de Impulso de las Tecnologías del Lenguaje (Villegas et al., 2017), disponible en: https://github.com/Med-TL/Plan-TL/tree/master/CUTEXTThe heavy use of medical terms motivated the construction of large terminological resources for English, such as the Unified Medical Language System (UMLS) or the Open Biological and Biomedical Ontology (OBO) ontologies. Purely manual construction of terminological resources is by itself very valuable, but constitutes (1) a highly time-consuming process, (2) it does not guarantee that included concepts or terms do actually align with the medical language and terms as they are being used in clinical documents by healthcare professionals and (3) requires constant update and revision due to changes and emergence of new biomedical concepts over time. In this paper we present a multilingual medical term extraction tool, called CUTEXT (Cvalue Used To Extract Terms), a resource promoted by the Spanish National Plan for the Advancement of Language Technology (Villegas et al., 2017), available at: https://github.com/Med-TL/Plan-TL/tree/master/CUTEXTEl presente trabajo fue realizado bajo la financiación de la Encomienda MINETAD-CNIO/OTG Sanidad Plan TL y el proyecto H2020 OpenMinted (654021)
    • …
    corecore